-
Notifications
You must be signed in to change notification settings - Fork 1.9k
in_calyptia_fleet: Only commit fleet config when successfully reloaded #10874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in_calyptia_fleet: Only commit fleet config when successfully reloaded #10874
Conversation
Signed-off-by: Alec Holmes <[email protected]>
WalkthroughAdds a hot-reload success flag to the global config, sets it during successful reloads, and integrates it into the Calyptia Fleet input plugin to manage config refs with commit/rollback logic. Refactors fleet config existence checks, adjusts add/commit/rollback flows, and cleans up uncommitted refs during init and collection. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Fleet as Calyptia Fleet Plugin
participant Core as Fluent Bit Core
participant Cfg as flb_config
participant Refs as Ref Files (new/cur/old)
Note over Fleet,Refs: Add/update config
Fleet->>Refs: calyptia_config_add(cfgname)\n- set "new" -> cfgname\n- reparent "cur" -> "old"\n- remove mismatched "new"
Fleet->>Core: trigger reload (path from "cur" or "old")
Fleet->>Cfg: set hot_reload_succeeded = FLB_FALSE
Note over Core,Cfg: Reload processing
Core->>Core: perform hot reload
Core->>Cfg: set hot_reload_succeeded = FLB_TRUE
Note over Fleet,Refs: Collection cycle
Fleet->>Cfg: check hot_reload_succeeded
alt success
Fleet->>Refs: commit: "new" -> "cur", remove "old"/"new"
else no reload or failure
Fleet->>Refs: rollback if needed:\n- delete "new"\n- "old" -> "cur"\n- remove "old"
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. ✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/flb_reload.c (1)
454-459: Also clear the success flag on the old context at reload start to avoid stale state.If a reload attempt aborts, the active (old) context may still carry
hot_reload_succeeded = TRUEfrom a prior success. Clearing it when marking hot reload prevents ambiguous state for any consumer not setting it explicitly (your Fleet plugin does, but making reload self-contained is safer).Diff suggestion:
- /* Mark hot reloading for old ctx to prevent duplicated request via HTTP */ - old_config->hot_reloading = FLB_TRUE; + /* Mark hot reloading for old ctx to prevent duplicated request via HTTP */ + old_config->hot_reloading = FLB_TRUE; + /* Clear stale success state on the old context */ + old_config->hot_reload_succeeded = FLB_FALSE; ... - new_config->hot_reloading = FLB_FALSE; + new_config->hot_reloading = FLB_FALSE; + /* Mark success */ new_config->hot_reload_succeeded = FLB_TRUE;Also applies to: 551-553
plugins/in_calyptia_fleet/in_calyptia_fleet.c (1)
1490-1563: Validate dereferenced ref paths before recursive delete to prevent path traversal/file clobber.
calyptia_config_delete_by_ref()dereferences a.reffile and then deletes all matches for its prefix (/path/to/file*). If an attacker tampers with the.reffile (or it gets corrupted), this can unlink arbitrary files/directories outside the Fleet cache. Before deleting:
- Canonicalize both the base Fleet dir and the dereferenced path (e.g.,
realpath()on POSIX;GetFullPathNameA()on Windows),- Verify the dereferenced path is within the base dir, and
- Refuse to operate if the check fails.
Concrete patch (POSIX path check; add equivalent for Windows):
static int calyptia_config_delete_by_ref(struct flb_in_calyptia_fleet_config *ctx, const char *ref_name) { - struct cfl_array *confs; + struct cfl_array *confs; flb_sds_t config_path; + flb_sds_t base_dir = NULL; + char canon_base[PATH_MAX]; + char canon_target[PATH_MAX]; char *ext; int idx; struct stat entry_stat; const char *entry_path; @@ - config_path = fleet_config_deref(ctx, ref_name); + config_path = fleet_config_deref(ctx, ref_name); if (config_path == NULL) { return FLB_FALSE; } + + /* Constrain deletes to the Fleet base directory */ + base_dir = generate_base_fleet_directory(ctx); + if (base_dir == NULL) { + flb_sds_destroy(config_path); + return FLB_FALSE; + } +#ifndef _WIN32 + if (realpath(base_dir, canon_base) == NULL || + realpath(config_path, canon_target) == NULL) { + flb_plg_error(ctx->ins, "failed to canonicalize paths for delete safety"); + flb_sds_destroy(base_dir); + flb_sds_destroy(config_path); + return FLB_FALSE; + } + if (strncmp(canon_target, canon_base, strlen(canon_base)) != 0 || + (canon_target[strlen(canon_base)] != '/' && canon_target[strlen(canon_base)] != '\0')) { + flb_plg_error(ctx->ins, "ref path %s not under base dir %s", canon_target, canon_base); + flb_sds_destroy(base_dir); + flb_sds_destroy(config_path); + return FLB_FALSE; + } +#else + /* TODO: Add GetFullPathNameA() + prefix check on Windows */ +#endif + flb_sds_destroy(base_dir);
🧹 Nitpick comments (7)
plugins/in_calyptia_fleet/in_calyptia_fleet.c (7)
621-630: Minor: free order/readability in reload thread.You log
reload->cfg_paththenflb_free(reload). Safe, sincecfg_pathownership is transferred earlier, but consider storingcfg_pathin a local before the free for clarity:- flb_info("reloading configuration from path: %s", reload->cfg_path); - flb_free(reload); + flb_info("reloading configuration from path: %s", reload->cfg_path); + flb_free(reload);
1566-1613: Add error handling around unlink of cur ref; preserve ENOENT, log others.Currently
unlink(cur_ref_filename);ignores failures. Recommend ignoringENOENTbut logging other errors to aid debugging.- unlink(cur_ref_filename); + if (unlink(cur_ref_filename) != 0 && errno != ENOENT) { + flb_plg_warn(ctx->ins, "failed to delete current ref file: %s (errno=%d)", + cur_ref_filename, errno); + }
1621-1667: Commit flow looks correct; optionally reset success flag post-commit.Once committed,
hot_reload_succeededremains TRUE in the active config indefinitely. Not harmful (commit re-checks absence of "new"), but you could clear it to explicitly record “no pending commit”.flb_plg_info(ctx->ins, "committed new config: %s", config_path); flb_sds_destroy(config_path); + /* Optional: clear success flag now that commit has been processed */ + ctx->ins->config->hot_reload_succeeded = FLB_FALSE;
1670-1709: Rollback flow OK; consider logging when no 'new' exists to trim noise.If
exists_fleet_config(ctx, "new") == FLB_TRUEfails often, the log volume can be high. Downgrade to debug when nothing to delete. No functional change required.
1719-1754: Commit-on-success guard is solid; add one more safety: ensure timestamped path.You already check
is_new_fleet_config(ctx, config). As an extra guard against ref tampering, verify the currentconf_path_fileis under the Fleet base dir (same canonicalization as delete). Optional.
2037-2060: Do not hard-fail collection solely on commit failure.
FLB_INPUT_RETURN(-1)may disable the collector after a transient filesystem hiccup during commit. Consider logging and retrying on the next tick to keep the agent ingesting with the current config.- if (commit_config_if_reloaded(ctx) == FLB_FALSE) { - FLB_INPUT_RETURN(-1); - } + if (commit_config_if_reloaded(ctx) == FLB_FALSE) { + flb_plg_error(ctx->ins, "commit of reloaded configuration failed; will retry"); + FLB_INPUT_RETURN(0); + }
1097-1158: Efficient timestamp check; avoid zero-capacity SDS.
glob_pattern = flb_sds_create_size(0);works, but using an initial capacity nearbase_dir + "/*.yaml"reduces reallocs slightly.No change required; micro-optimization.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
include/fluent-bit/flb_config.h(1 hunks)plugins/in_calyptia_fleet/in_calyptia_fleet.c(12 hunks)src/flb_config.c(1 hunks)src/flb_reload.c(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_calyptia_fleet/in_calyptia_fleet.c (2)
src/flb_sds.c (1)
flb_sds_destroy(389-399)include/fluent-bit/flb_mem.h (1)
flb_free(126-128)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-centos-7
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: PR - fuzzing test
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
🔇 Additional comments (6)
src/flb_reload.c (1)
551-553: Good: mark success on the new config after a clean start.Setting
new_config->hot_reload_succeeded = FLB_TRUE;is the right signal for post-reload commit logic.src/flb_config.c (1)
310-311: LGTM: sane default.Initializing
config->hot_reload_succeeded = FLB_FALSE;avoids indeterminate reads.plugins/in_calyptia_fleet/in_calyptia_fleet.c (4)
576-591: exists_fleet_config: OK, but avoid TOCTOU assumptions in callers.The deref +
access()check is fine as a hint, but callers should be tolerant to the file disappearing between check and use (most are). No change requested.
1197-1210: Confirm HTTP-date parsing matches server format.You parse Last-Modified with
"%a, %d %B %Y %H:%M:%S GMT". RFC 7231 uses abbreviated month (%b). If the server sends “Wed, 21 Oct 2015 07:28:00 GMT”,%Bwill fail. If this format was already validated in production, ignore; otherwise consider switching to%bwith a fallback.
2149-2177: Startup load prefers cur then old: good failover.Nice improvement: avoids promoting an uncommitted “new” on process start.
2419-2535: Good: cleanup uncommitted 'new' on startup; ensure delete safety shares the same base-dir validation.This path calls
calyptia_config_delete_by_ref(ctx, "new"); once the delete hardening (base-dir canonicalization) is in place, this startup cleanup becomes safe even under ref corruption.
pwhelan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
How this interacts with #10869 might necessitate a refactoring but that can be handled in that PR.
This fixes a fleet bug where fetching an invalid config then shutting down prevented the agent from starting up again.
NB for people who reviewed this as an earlier draft: this is equivalent to the now-closed PR in my fork repo.
Problem
The fleet plugin is buggy today in that it will immediately commit newly received config as the current config, implying the new config is valid. This can notably prevent fluent-bit from restarting correctly after receiving an invalid config.
For example:
Change
This PR changes how configs received by the fleet plugin are committed as valid.
When a new config is received:
When a process starts up:
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
New Features
Refactor